Skip to main content

Exact Data Match Database

The EDM feature utilizes a compiled database of hashed versions of the sensitive data that you have provided and monitors your data traffic to detect a matching hashed value. To create the EDM database and maintain the security of the original content that is being searched for Cyberhaven have provided a tool to create the hashed EDM database rows that will be used for matching the inspected content.

Exact Data Match Command Line Interface Tool

The EDM CLI Tool is a Python-based utility offered by Cyberhaven to help you easily create a hash for the sensitive data you want to detect in your data flow. This tool is deployed on-premises to ensure that only the hash of your sensitive data is sent to Cyberhaven.

The EDM CLI tool works independently from the Cyberhaven Sensor and can be run on any machine where your sensitive data resides. You do not need to install the Sensor on the same machine that runs this tool. This tool provides you with full control over your sensitive data. It processes data by hashing it locally on your machine, ensuring no plain text data leaves your system.

Cyberhaven decided to use Python for developing this tool with the intention of providing you with transparency and control over the entire process of hashing and uploading. This means you can directly oversee how your confidential data is handled.

You can run this utility on a machine where you have access to the raw data. The Cyberhaven Sensor does not have to be installed on this machine, the tool is independent of the Sensor. We made a deliberate choice to only operate on hashed data. Hashing is performed on your system, and the Cyberhaven Console only ingests hashed data, not your plain text data, which may be confidential. Moreover, the tool is implemented in Python so that you can fully review the hashing and uploading process and thus be in full control of the processing of your plain text data.

Installation of the EDM CLI Tool

Before you begin the installation process, ensure the following:

The Python version is 3.7 or greater.

You have “pip” and “make” installed on your machine.

The data you want to monitor is in CSV format.

Installing the EDM CLI Tool

The installation process involves the following steps.

1. Download the source code

2. Start a new virtual environment

3. Create the installation archive

4. Install the EDM CLI tool

Follow this detailed procedure to install the EDM CLI tool.

1. Download the .zip file with the source code from the edm-cli repository on GitHub.

Go to the following GitHub location.

https://github.com/CyberhavenInc/edm-cli

2. Unzip the file which includes all the necessary files to run the EDM CLI tool.

3. Start a new virtual environment to separately maintain all the

dependencies required for the EDM CLI tool. Open a CLI and run the following command.

python -m venv .venv

4. Run the following command to activate the virtual environment. source .venv/bin/activate

5. Create the installation archive (.tar.gz file) for the EDM CLI tool. Run the following “make” command.

make build

Running this command creates a "dist" directory where the .tar.gz file containing the EDM CLI tool will be located.

6. Run the following command to install the EDM CLI tool.

pip install ./dist/edmtool-<version>.tar.gz

See, Hash Generation and Database Creation

Hash Generation and Database Creation

Use the EDM CLI tool to do the following:

Generate a hashed database file from the source CSV file with the sensitive data you want to match.

Create a new database entry in the Cyberhaven Console and upload the hashed database file to the Console.

Hashing the data

To generate the database file with the hashed values, open a terminal window and navigate to the directory where your source CSV file is located, then run the following command.

PlaintextCopy
edmtool encode --algorithm "sha256" --db_file_path ./path/to/your/file.csv --db_file_delimiter ","

The above command generates two encoded files for your CSV file:

<file-name> _encoded.csv: Contains the hash values representing the data in your CSV file.

<file-name> _encoded_metadata.json: Contains essential metadata such as the header information, file size, checksum, type of hashing algorithm used to encode the file, and the total number of rows in the CSV file.

** IMPORTANT **

Cyberhaven requires that the two encoded files are kept together and not modified or renamed.

If the CSV file contains empty cells, the encoding process with exclude the respective row and create the encoded files.

The source CSV file must not contain more than 10 million cells.

To see the complete list of CLI options type edmtool --help at the CLI prompt. See the EDM CLI Tool Options for more information.

Creating a database and uploading the hash

After you have generated the hashed database file, the next step is to create a database entry in the Cyberhaven Console and upload the hashed database file to the Console. The file you need to upload is <file-name>

_encoded_metadata.json.

This process requires an API token from the Cyberhaven Console which provides the necessary authentication required to perform actions in the Console such as creating a database, updating the database, and uploading files.

Follow these steps to generate the token and upload the hashed database file to the Cyberhaven Console.

1. Login to the Cyberhaven Console and navigate to Preferences > API token management and click Create New Token.

2. In the Token Creation pop-up window, enter a name to identify the token and then, click Create.

3. Save the token to a notepad. The token cannot be recovered once you close the window.>

4. Open your CLI and run the following command.

PythonCopy
edmtool create_and_upload --name <name-of-the-database> --description <describe-the-database> -- metadata_file_path ./bank accounts_encoded_metadata.json --base_url <your cyberhaven-url> --token <cyberhaven-api-token>

The above command uploads <file-name> _encoded_metadata.json to the Console and <file-name> _encoded.csv to the backend.

NOTE

Cyberhaven enforces a limit on the number of cells. The source CSV file must not contain more than 10 million cells. If you attempt to upload a file that exceeds this limit, the upload action will fail with an error.

To view the database that you created in the Cyberhaven Console, navigate to Preferences > Content matching rules and click on the Databases Management drop-down list.

Updating the database

If you have an updated CSV file and want to refresh the database entry in the Cyberhaven Console, then you must create a hash for the new CSV file and then update the database with the new encoded file.

To update the database and upload the new encoded file, you will require the following:

Database ID: Copy the database ID from the Cyberhaven Console. API token: The API token you previously generated to create this database.

Follow these steps to copy the database ID and update the database.

1. In the Cyberhaven Console, navigate to Preferences > Content matching rules and click on the Databases Management drop-down list.

2. Mouse over the database you want to update. The ID is displayed as a tooltip.

3. Copy the database ID to the clipboard.

4. In the CLI, run the following command.

PythonCopy
edmtool update_and_upload --id <database-id> --name <name-of-the-database> --description <describe-the database> --metadata_file_path ./bank accounts_encoded_metadata.json --base_url <your cyberhaven-url> --token <cyberhaven-api-token>

When an update is applied to the database, the version number in the 'Databases Management' drop-down list will be incremented. The 'Updated At' column is also updated to display the date and time of the most recent change made to the database.

The EDM CLI tool includes the following options.

CLI optionDescription
--idCyberhaven assigns a unique identifier to a database when it is created. The ID is essential when updating the database and uploading an encoded file.
--nameName of the database you are creating. Allows a maximum of 30 characters.
-- descriptionAsummary to help you identify the database you are creating. Allows a maximum of 255 characters.
--algorithmThe type of algorithm used to create or update the hash for the CSV file containing your sensitive data. Options are SHA-256 and Spooky.
-- db_file_pathThe location of the source CSV file to be encoded.
-- db_file_deli miterThe type of delimiter character or symbol used to separate the cells in your CSV file.
-- metadata_fil e_pathThe location of the encoded file, <file-name> _encoded_metadata.json. Cyberhaven uploads both the encoded files. So ensure that both the encoded files are kept together and not modified or renamed.
--tokenThe Cyberhaven API token necessary to authenticate with the Console to perform actions in the Console.
--base_urlThe URL of your Cyberhaven Console in the format https://<tenant-name>.cyberhaven.io.

Database Management menu

Inside the Cyberhaven console the Database Management drop-down list displays a list of all the databases you have uploaded using the EDM CLI tool. For each database, the list displays essential details such as,

Name: The identifier of the database.

Active Rules: The number of rules that use this database and are actively being used within datasets.

Inactive Rules: The number of rules that have been created but are not being used in any dataset.

Version: Indicates how many times the database has been updated. Update At: Displays the date and time when the database was most recently updated.

Actions: Allows you to delete a database. However, databases that are currently in use within a dataset cannot be deleted.